A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text

نویسندگان

  • Gonzalo Navarro
  • Mathieu Raffinot
چکیده

We address in this paper the problem of string matching on Lempel-Ziv compressed text. The goal is to search a pattern in a text without uncompressing. This is a highly relevant issue, since it is essential to have compressed text databases where eecient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of Lempel-Ziv compression. We then apply the scheme to each particular type of compression. We present the rst algorithm to nd all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more eecient search algorithm, which is faster than uncompressing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78. We show also how to search some extended patterns on Lempel-Ziv compressed text, such as classes of characters and approximate string matching.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A General Practical Approach to PatternMatching over Ziv - Lempel Compressed

We address the problem of string matching on Ziv-Lempel compressed text. The goal is to search a pattern in a text without un-compressing it. This is a highly relevant issue to keep compressed text databases where eecient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of Ziv-Lempe...

متن کامل

Byte pair encoding : a text compression scheme that accelerates pattern matching

Byte pair encoding (BPE) is a simple universal text compression scheme. Decompression is very fast and requires small work space. Moreover, it is easy to decompress an arbitrary part of the original text. However, it has not been so popular since the compression is rather slow and the compression ratio is not as good as other methods such as Lempel-Ziv type compression. In this paper, we bring ...

متن کامل

A Unifying Framework for Compressed Pattern Matching

We introduce a general framework which is suitable to capture an essence of compressed pattern matching according to various dictionary based compressions. The goal is to find all occurrences of a pattern in a text without decompression, which is one of the most active topics in string matching. Our framework includes such compression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-...

متن کامل

Approximate String Matching over Ziv - LempelCompressed

We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, speciically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions, in O(mkn + R) ti...

متن کامل

Approximate String Matching over Ziv

We present a solution to the problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, speciically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions, in O(mkn + R) ti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999